Diving in: Exploring Time Series Data

Part 1

Back to Navigation

Forecasting Air Pollution: Univariate EDA

The next important step in any data science workflow is to use exploratory data analysis to get to know the patterns and structure of our data. For time series, this is especially imortant, as we want to be able to identify any seasonal patterns, trends, and stationarity of our series, in order to help us understand what type of model we should be using to forecast into the future.

Dealing With Thousands of Points

One issue that pops up a lot in data science and data mining is that when you have too many data points, EDA becomes very slow, and your plots become unreadable, as no human can really detect a pattern when there are more points than there are pixels in your plot. So, to combat this, we can make an assumption to guide our analysis:

  1. Hourly data which depends on the weather, and on the flow of people, typically has a daily seasonality (i.e. hourly patterns repeat daily to some degree, due to day and night), a weekly seasonality (due to the work schedule), and a yearly seasonality. This way already know what to look for.

Loading our Data

In Part 1, we already preprocessed our data, and stored it as a hash table. We can simply load that up:

china <- readRDS("store/china.Rds")
beijing <- china$BeijingPM_

Initial plot

Before we get into it, lets make a simple plot of our data:

library(ggplot2)
library(forecast)
library(ggthemes)
autoplot(beijing$PM_US) + 
  theme_hc() + labs(y = "PM2.5")
Preprocessed and Unclean Time Series

Preprocessed and Unclean Time Series

The first thing we notice, immediately, is that we have negative values a few times. This is impossible. These are likely due to issues with A) the sensor (a high likelihood event), B) data input error (less likely, hopefully nobody input this by hand), or C) something weird happening with our spline interpolation (a definite possibility). Lets see how many of these points there are:

library(tidyverse)
colnames(beijing)
#>  [1] "No"              "year"            "month"          
#>  [4] "day"             "hour"            "season"         
#>  [7] "PM_Dongsi"       "PM_Dongsihuan"   "PM_Nongzhanguan"
#> [10] "PM_US.Post"      "DEWP"            "HUMI"           
#> [13] "PRES"            "TEMP"            "cbwd"           
#> [16] "Iws"             "precipitation"   "Iprec"
beijing %>% filter(PM_US.Post < 0) %>% 
  nrow/(nrow(beijing)) * 100
#> [1] 0.4183782
read.csv("data/BeijingPM20100101_20151231.csv") %>% 
  filter(PM_US.Post < 0) %>% 
  nrow/nrow(beijing) * 100
#> [1] 0

Tidying up

So 0.4% of our points are negative, and and they are all due to our interpolation. This is an important thing to check. Luckily, there are very few and they are at a small value, so we can feel safe in takin their absolute value. We will also go ahead and clean away any extreme values in our time series. As we are forecasting trends, with large prediction intervals, if a point is twice as high as an already high point (and hitting 1000 PPM is exceptionally high), it will not improve our forecast of the pattern. Regardless of if the value is 400 or 1000, that is high enough to put out a critical air quality alert, and keeping the points at 1000 will only hurt our forecast. Along with this, it has been shown in numerous reports that these sensors have issues at high levels, and create errors just like this. A cursory autoplot confirms that these are not due to our interpolation but actual measured data. So, we will fix the negative values and shrink the exceptional ones so our whole series is on the same, useful scale:

library(pipeR)
uvar <- beijing$PM_US %>>% 
  abs %>>% tsclean

# Save our work to the hash table
china[['uvar']] <- uvar

First Steps: Raw Data Scatterplot

autoplot(uvar) + theme_hc() + labs(y = "PM2.5")
Preprocessed and Cleaned Time Series

Preprocessed and Cleaned Time Series

There we go, now we finally have a time series we can work with. Lets go ahead and discuss what we can see in this plot. First, we can see with a great deal of certainty theres a long term seasonal pattern, potentially yearly. Lets try to zoom in a bit to see if we can find a weekly and daily pattern, just using the raw data:

Scanning for Daily Seasonality

We will first cut our time series and pick a random week or two of data, and then plot it, drawing lines at midnight every day.

uvarShort <- window(uvar, 
                    start = c(4), 
                    end = c(4,7*48))
uvarShort <- ts(uvarShort, frequency = 24)
plot(uvarShort, ylab = "PM2.5")
abline(v = 1:14, col = "red")
2 weeks of PM 2.5 measurements

2 weeks of PM 2.5 measurements

What is imortant to look at here is the pattern in between each day (red lines). We see here that there appears to be a peak in the nighttime hours, and a trough in the daytime hours. This pattern is not consistent in scale, but it is pretty consistent in occurence. We can take another look, with a season plot, where a time series is cut into periods, and each period is plotted repeatedly over each other. We will only look at 10 days here.

uvarDays <- ts(uvar,
               frequency = 24, 
               start = c(4, 24*70), 
               end = c(4, 24*80))
ggseasonplot(uvarDays) +  theme_hc()  + scale_color_hc()
Season plot of hourly data over 10 days

Season plot of hourly data over 10 days

In this plot, we are looking at matching and trends lining up. What we see is clear evidence that during the morning and day time, PM 2.5 seems to dip, whereas at nighttime (especially 6 pm to like 4-5 am), we are getting a peak. This is another sign of daily seasonality.

Resampling a Time Series

In order to see the more long term trends in a time series, we are going to need to resample it, to a lower frequency. Otherwise, the data is simply too noisy for us to understand (overplotting). We are going to want to take daily, weekly, and quarterly samples in order to see different trends (just as hourly data helped us see the daily pattern, daily data will help us see a weekly pattern, weekly quarterly or monthly, and quarterly a yearly pattern, and so on). As we are going to do this multiple times, at different frequencies, instead of writing multiple functions by hand, lets write a function that outputs a function, so that we can quickly generate resampling functions on the fly. Before we can do that, lets discuss a plan of attack, and another useful R tool: tapply.

The tapply function

The tapply function is an incredibly powerful function that should be in everyones toolbelt, as when you need it, it can cut down the number of lines of code you need. tapply allow us to apply a function to subsets of a vector, using another vector as a subsetting tool. In our case, we will subset by the number of times our desired period divides our original time series. To do this we will use the %/% operator.

The %/% Operator

This operator does sort of the opposite of the %% operator: instead of returning the remainder, it returns the number of times the RHS can divide the LHS. As an example, lets say we have 50 points, and want to group them with a period of four. We can then define our grouping vector as follows:

x <- 1:50
#  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
# [27] 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
y <- (x-1) %/% 4
y
#>  [1]  0  0  0  0  1  1  1  1  2  2  2  2  3  3  3  3  4  4  4  4  5  5  5
#> [24]  5  6  6  6  6  7  7  7  7  8  8  8  8  9  9  9  9 10 10 10 10 11 11
#> [47] 11 11 12 12

We see the first 4 values, denoted by 0, represent the first period of our series. The next 4 values, denoted by 1, represent the second period. Note that to get our periods to work out nicely like this, we had to subtract 1 from the original vector. Now, let us resample:

changeSamples <- function(period){
  function(vec) {
    result <- unname(tapply(vec,
                            (seq_along(vec)-1) %/% period, 
                            mean))
    return(
           ts(result, frequency = (8760/period))
    )
  }
}

Next, using our resampling function, lets quickly write a daily, weekly, monthly, and quarterly resampler:

toDays <- changeSamples(24)
toWeeks <- changeSamples(24 * 7)
toMonths  <- changeSamples(24 * 7 * 4)
toQuarters <- changeSamples(24 * 365 / 4)

Working with Lists of Functions

A lesser known R trick, when you have bunch of functions you want to run over a dataset, is to use a list of functions, and lapply them over the dataset:

resamplers <- list(
                   daily =  toDays,
                   weekly = toWeeks,
                   monthly = toMonths,
                   quarterly = toQuarters
)

resamples <- lapply(resamplers, function(f) f(uvar))
data.frame(t(sapply(resamples, frequency))) 

This monthly sampling is a little off, but otherwise this is a very useful set of data we now have, and we can go on with our EDA.

Seasonal Plots En Masse

Instead of painstakingly writing code to make each plot, now that we are familiar with the general structure, I can suggest two ways of plotting our seasonal plots: a base R way with lapply, and a clean way with walk. I have strong opinions against purrr as a whole, however, walk, which applies a function to an object but only returns the functions “side effects” is incredibly useful. Lets show an example with our seasonal plot:

themedSeason <- function(ts) {
  p <- ggseasonplot(ts) + 
    scale_color_hc() + 
    theme_hc() 
  print(p)
}

walk(resamples, themedSeason)

We instantly produced four well themed plots. Now, instead of this, we may want to plot them in a grid, and with that we can use another incredibly useful function cowplot::plot_grid:

seasonPlots <- lapply(resamples, function(x) {
                        ggseasonplot(x) + scale_color_hc() + theme_hc()})
cowplot::plot_grid(plotlist = seasonPlots)
Daily, Weekly, Monthly, and Quarterly Seasonal Plots

Daily, Weekly, Monthly, and Quarterly Seasonal Plots

Interpretation

What have we learned from these seasonal plots?

Well, we havent learned everything, but we have learned one important thing: There is a clear yearly trend. Especially looking at the quarterly lot, but really in each plot (except the monthly plot, which is a mess), we see that the average PM2.5 content is more or less the same each year, as the shape of each year lines up.

Detecting a Weekly Trend

To detect the weekly trend, lets revisit our base R plot with lines on each period trick. We will use our daily data, and take maybe 10 or so weeks (as at some point we start to overplot). These weeks were chosen completely at random:

uvarWeeks <- window(resamples$daily, 
                    start = c(5, 30*7), 
                    end = c(5, 40*7))
uvarWeeks <- ts(uvarWeeks, frequency = 7)
plot(uvarWeeks)
abline(v = 1:26, col = "red")

This is a very clear example of a weekly seasonality, there is a peak in the middle of each week. This makes sense, as there is a lot more driving during the work week.

ACF

In time series, another very imortant tool is the ACF, and more specifically the ACF plot. What this shows, is how the current observation is correlated with the observation \(\ell\) timesteps, or lags ago. Lets check it out on each sampling of our data:

acf(uvar, lag.max = 8760)
ACF: Hourly Data

ACF: Hourly Data

This tells us a ton about the nature of our data. On short time scales (think hours), our data exhibits explosive growth. What this means is that the value at the present time is highly positively correlated with what occured recently. Looking at our previous plots, this behavior is readily visible. What is very interesting however, is the strong negative correlation region at about half a year. This means that values now are highly negatively correlated with what happened about 6 months ago. Referring back to our quarterly season plot, we can very clearly see this behavior. What this plot also tells us, with its extreme complexity and very high ACFs overall, is this data is going to be incredibly difficult to analyze through classical methods, as theres an inherently complex and likely multiseasonal autocorrelation structure.

ACFs at other sampling rates:

par(mfrow = c(2,2))
walk(resamples, function(x) acf(x, lag.max = frequency(x)))

We see the same behavior demonstrated nicely here. There is a clear, strong, yearly seasonality, which will be quite a challenge for us to deal with, as there are other more subtle seasonalities which we were able to pick up on with our other plots.

The Parzen Window

The final plot which I use frequently is the smoothed periodogram, in this case specifically the parzen window. What it shows basically is at what frequencies a lot of our data seems to be. Lets start out with a parzen plot of our original data:

library(tswge)
parzen <- parzen.wge(uvar)
Parzen window of hourly data

Parzen window of hourly data

What does this mean? Well, first, we see the overall shape of the graph: it peaks on the left, and decays going to the right. This is indicative of strong low frequency patterns in the data. This also, on its own, indicates a lot of times a model with a clear trend (ARIMA). However, we know from our previous EDA that there is not much of an up or down trend here, so this peak at zero could be indiciative of a very long seasonal pattern, or a high order AR component. These other peaks all along the time series are typically evidence of an ARIMA model with seasonality (such as the classic airline data). Again, this data does not appear to have any integrated (ARIMA) order or root on the unit circle (If you dont know what this means, stay tuned for next time). So this in our case could be a sign of a long seasonality, combined with some shorter seasonalities. Lets one more sample plot to confirm the presence of a yearly seasonality:

quarters <- resamples$quarterly
parzenq <- parzen.wge(quarters)

We see a peak at 0.25, indiciative of a period of 4 observations, or in this case quarters. This means again we have a yearly trend in our series

Conclusions

We have detected probably 3 seasonalities in our data: daily, weekly, and yearly. By resampling our data, we were able to eaily observe these patterns. Next, we need to think about what sort of models we want to use. Given the complexity of the data, and the presence of an exceptionally long (and multiple) seasonal patterns, we are likely going to have a hard time to find appropriate pure ARIMA and other classical models (which we will do next time). We will likely need to consider moving to more interesting fourier based models in the future.

We do not see any evidence of a non-seasonal trend in our data, and as far as stationarity (constant mean, constant variance, constant autocorrelation structure), we have clearly violated it. Therefore, a model with a minimal of a seasonal ARIMA model is going to be required. We will next time explore classical univariate models, as well as determining model appropriacy. Stay tuned!

Saving our work:

We updated our “china” hash table, so we are going to go ahead and update it on our disk too (to include this uvar object):

saveRDS(china, file = 'store/china.Rds')

Thanks for reading!

David Josephs

2019-08-20